For health outcomes data, we are interested in the incidence of asthma, skin cancer and lung cancer across time and space. We obtained age-adjusted incidence rates of asthma related hospitalizations, age-adjusted incidence rates of Melanoma of the skin and age-adjusted incidence rates of lung cancer. The rates were per 100,000 people for the three U.S. states New York, Ohio and Pennsylvania over multiple years. The following bullet points provide links from where we obtained the data.
We also obtained age-adjusted incidence rates data for these three health outcomes for the U.S. states New York, Ohio and Pennsylvania at the county-level at fixed time points. The following bullet points provide links from where we obtained the data. The asthma data is provided for the year 2016 and the skin cancer and lung cancer data is averaged over the years 2014-2018.
Each of these data sets were exported and each imported into R. For each of the three health outcomes, the data sets were merged across the three states for the longitudinal data and for the cross-sectional data. This effort resulted in six data sets. The data import and cleaning of the original data for asthma, melanoma of the skin and lung cancer can be found in our github repository.
We will now import the six data sets and merge all longitudinal data together and merge all cross-sectional data together to have two data sets.
First, let’s combine the longitudinal lung and skin cancer data at the state level.
# read in longitudinal state data for lung and skin cancer
# longitudinal state level lung cancer data
state_lc_data = read_csv("./data/lung/state_lc_data.csv")
# longitudinal state level melanoma data
state_mel_data = read_csv("./data/incidence_melanoma_data/state_melanoma_age_adjusted.csv") %>%
rename(c("outcome" = "cancer_type"))
# combine cancer outcome data
lc_mel_state = bind_rows(state_lc_data, state_mel_data) %>%
select(state, year, outcome, age_adjusted_incidence_rate)
Next, let’s combine the longitudinal asthma data with the longitudinal cancer data at the state level.
We begin by reading in the longitudinal state-level asthma data.
# Read in longitudinal state-level asthma data
# Rename and select columns
asthma_state <- read_csv(here::here("data", "asthma_state.csv")) %>%
rename(age_adjusted_incidence_rate = rate) %>%
select(state, year, outcome, age_adjusted_incidence_rate)
Next, we can combine the longitudinal asthma and the longitudinal cancer state-level data.
# Bind rows of longitudinal asthma and longitudinal cancer by state and arrange
# accordingly
lc_mel_asthma_state <- bind_rows(lc_mel_state, asthma_state) %>%
arrange(state, year, outcome)
The resulting data set has 234 rows and 4 columns. There are 3 states: New York, Ohio and Pennsylvania. Years span from 1976 to 2019. There are 234 non-missing age-adjusted incidence rates. So, no missing data.
The columns in this data set are:
state: The U.S. state.
year: The year.
outcome: The health outcome.
age_adjusted_incidence_rate: The age-adjusted incidence rate for the health outcome per 100,000.
Lastly, let’s write out the combined data.
write_csv(lc_mel_asthma_state, here::here("data", "lc_mel_asthma_state.csv"))
First, let’s combine the cross-sectional lung and skin cancer data at the county level. We will add a column here for the county FIPS code for mapping purposes.
# read in cross-sectional county data for lung and skin cancer
county_lc = read_csv("./data/lung/county_lc.csv")
county_mel = read_csv("./data/incidence_melanoma_data/state_county_melanoma_incidence_2014_2018.csv")
fips_codes = read_csv("./data/fips_codes.csv")
# cleaning to make the data sets compatible
county_mel = county_mel %>%
rename(c("outcome" = "cancer_type")) %>%
drop_na()
county_mel <- county_mel[!(county_mel$county == "Ohio" | county_mel$county == "Pennsylvania" |
county_mel$county == "New York"), ] %>%
filter(!grepl('SEER', county))
county_mel$county <- gsub(" County","", county_mel$county)
county_mel$county[county_mel$county == "St Lawrence"] <- "St. Lawrence"
county_lc$county <- gsub(" County","", county_lc$county)
# binding rows
lc_mel_county = bind_rows(county_lc, county_mel) %>%
filter(complete.cases(.))
# Add FIPS column to the data
fips = fips_codes %>%
janitor::clean_names() %>%
rename(c("county" = "name"))
# fips df has 'St Lawrence' instead of 'St. Lawrence'
fips$county[fips$county == "St Lawrence"] <- "St. Lawrence"
lc_mel_county = merge(lc_mel_county, fips, by = c("state", "county"))
Next, we will read in the cross-sectional asthma county data and combine it with the cross-sectional cancer data.
We begin by reading in the cross-sectional asthma county data, renaming column names and adding a column for county FIPS code.
# Read in cross-sectional asthma data
# Only get county data for the year 2016 since we need county data at
# a fixed time point (i.e. not longitudinal)
# Adjust column names and types in prep for merge
asthma_county <- read_csv(here::here("data", "asthma_county.csv")) %>%
filter(year == 2016) %>%
select(-year) %>%
rename(age_adjusted_incidence_rate = rate) %>%
mutate(age_adjusted_incidence_rate = as.double(age_adjusted_incidence_rate))
# asthma_county has 'New York County' instead of 'New York' in fips df, and "GAllia" instead of "Gallia"
asthma_county$county[asthma_county$county == "New York County"] <- "New York"
asthma_county$county[asthma_county$county == "GAllia"] <- "Gallia"
asthma_county <- merge(asthma_county, fips, by = c("state", "county"))
Next, let’s combine the cross-sectional asthma and cancer county data.
# Bind rows of cancer and asthma by county and arrange accordingly
lc_mel_asthma_county <- bind_rows(lc_mel_county, asthma_county) %>%
arrange(state, county, outcome)
The resulting data set has 647 rows and 5 columns. There are 3 states: New York, Ohio and Pennsylvania.
There are 624 non-missing age-adjusted incidence rates.
The columns in this data set are:
state: The U.S. state.
county: The county in the state.
outcome: The health outcome.
age_adjusted_incidence_rate: The age-adjusted incidence rate for the health outcome per 100,000.
fips: The state-county FIPS code.
The following table displays the number of counties for which we have non-missing age-adjusted incidence rates in each state and the total number of counties in each state
lc_mel_asthma_county %>%
group_by(state) %>%
summarize(
non_missing_county = sum(!is.na(age_adjusted_incidence_rate)),
total_counties = n()
) %>%
knitr::kable(col.names = c("State", "Non-Missing County", "Total County"))
| State | Non-Missing County | Total County |
|---|---|---|
| NY | 185 | 185 |
| OH | 264 | 264 |
| PA | 175 | 198 |
From this table, we see that we are missing 23 age-adjusted incidence rates from Pennsylvania. We have all age-adjusted incidence rates for New York and Ohio.
Lastly, let’s write out the combined data.
write_csv(lc_mel_asthma_county, here::here("data", "lc_mel_asthma_county.csv"))
Now, that we have combined the health outcomes data, we can explore.
We can define a function to generate cross-sectional maps for a given health outcome at the county-level.
# Purpose: Generates a map for the given county-level health outcome data,
# outcome and plot title.
# Arguments: df: The data frame, the county-level health outcome data.
# outcome_var: a character, the health outcome of interest.
# plot_title: a character, the plot title.
# Returns: The plotly map.
# lc_mel_asthma_county has 'St. Lawrence' instead of 'St Lawrence' the counties list will need to map
lc_mel_asthma_county$county[lc_mel_asthma_county$county == "St. Lawrence"] <- "St Lawrence"
map_by_outcome <- function(df, outcome_var, plot_title) {
url <- 'https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json'
counties <- rjson::fromJSON(file = url)
df %>%
filter(outcome == outcome_var) %>%
plot_ly() %>%
add_trace(
type = "choroplethmapbox",
geojson = counties,
locations = ~fips,
z = ~age_adjusted_incidence_rate,
text = ~county,
colorscale = "Viridis",
reversescale = TRUE,
marker = list(line = list(width = 0),
opacity = 0.5)
) %>%
colorbar(title = "Age-Adjusted Incidence Rate") %>%
layout(
title = plot_title,
mapbox = list(
style = "carto-positron",
zoom = 4,
center = list(lon = -77.215135, lat = 41.164818)
)
)
}
We can also define a function to generate longitudinal line graphs of the age-adjusted incidence rate of the specified health outcome by year for all three states.
# Purpose: Generate longitudinal line graphs of the age-adjusted incidence rate
# of the specified health outcome by year for all three states.
# Arguments: df: The data frame, the longitudinal state-level health outcome
# data.
# outcome_var: a character, the health outcome of interest.
# plot_title: a character, the plot title.
# Returns: The plotly map.
aair_by_year_by_outcome <- function(df, outcome_var, plot_title) {
df %>%
filter(outcome == outcome_var) %>%
pivot_wider(names_from = "state",
values_from = "age_adjusted_incidence_rate") %>%
plot_ly(x = ~year) %>%
add_lines(y = ~NY, name = "New York") %>%
add_lines(y = ~OH, name = "Ohio") %>%
add_lines(y = ~PA, name = "Pennsylvania") %>%
layout(
title = plot_title,
xaxis = list(
rangeselector = list(buttons = list(list(count = 1,
label = "1 yr",
step = "year",
stepmode = "backward"),
list(count = 5,
label = "5 yr",
step = "year",
stepmode = "backward"),
list(count = 10,
label = "10 yr",
step = "year",
stepmode = "backward"),
list(step = "all"))),
rangeslider = list(type = "year"),
title = "Year"),
yaxis = list(title = "Age-Adjusted Incidence Rate per 100,000")
)
}
Now, let’s plot the cross-sectional map for each health outcome at the county-level followed by the longitudinal line graph of the age-adjusted incidence rate of the health outcome by year for all three states.
Asthma:
map_by_outcome(lc_mel_asthma_county,
"asthma",
"Asthma Age-Adjusted Incidence Rates (2016)")
aair_by_year_by_outcome(lc_mel_asthma_state,
"asthma",
"Asthma Age-Adjusted Incidence Rates")
Melanoma of the skin:
map_by_outcome(lc_mel_asthma_county,
"melanoma",
"Melanoma of the Skin Age-Adjusted Incidence Rates (2014-2018)")
aair_by_year_by_outcome(lc_mel_asthma_state,
"melanoma",
"Melanoma of the Skin Age-Adjusted Incidence Rates")
Lung Cancer:
map_by_outcome(lc_mel_asthma_county,
"lung cancer",
"Lung Cancer Age-Adjusted Incidence Rates (2014-2018)")